This project examines the house sale data from Chennai, India. The goals are to try to predict the house sale price. For this analysis, we first examine the distribution of the variables and look for relationships. Next, we perform Regression analysis predicting the house sale price for the neighborhoods. Second, we will perform Classification analysis predicting if a home area has a high or low sale price. Finally, we end with summarizing our conclusions.
The best regression model was the regression tree with a accuracy of 91% and the best classification model was the classification tree with a Sensitivity of 92%*.
We find that the variables that decrease home sale price are:
The variables that increase home sale price are:
This project examines housing data from Chennai, India. We will perform both regression and classification analysis. The goal for the regression models is to predict the house sale price using the variables in the dataset. For this analysis, we first examine the distribution of the variables and look for relationships. Next, we plan to perform Regression analysis predicting the house sale price for the areas We will use a variety of methods including linear regression, regression trees, and lasso regression. Second, we will perform Classification analysis predicting if an area has a high or low house sale value. We will use both logistic regression and classification trees. Finally, we end with summarizing our conclusions. We will examine the variables in the dataset to determine what helps to predict the house sale value.
This data set has 7109 rows and 16 variables.
VARIABLES TO PREDICT WITH
VARIABLES WE WANT TO PREDICT
AREA (House neighborhood) variables is truncated but we can
see the values in the bottom table.
AREA INT_SQFT DIST_MAINROAD N_BEDROOM
Length:7109 Min. : 500 Min. : 0.0 Min. :1.000
Class :character 1st Qu.: 993 1st Qu.: 50.0 1st Qu.:1.000
Mode :character Median :1373 Median : 99.0 Median :1.000
Mean :1382 Mean : 99.6 Mean :1.637
3rd Qu.:1744 3rd Qu.:148.0 3rd Qu.:2.000
Max. :2500 Max. :200.0 Max. :4.000
NA's :1
N_BATHROOM N_ROOM SALE_COND PARK_FACIL
Min. :1.000 Min. :2.000 AdjLand :1433 No :3520
1st Qu.:1.000 1st Qu.:3.000 Partial :1429 Noo: 2
Median :1.000 Median :4.000 Normal Sale:1423 Yes:3587
Mean :1.213 Mean :3.689 AbNormal :1406
3rd Qu.:1.000 3rd Qu.:4.000 Family :1403
Max. :2.000 Max. :6.000 Adj Land : 6
NA's :5 (Other) : 9
BUILDTYPE UTILITY_AVAIL SALES_PRICE SALES_PRICEhigh PAVED
Comercial : 4 All Pub: 1 Min. : 2156875 Yes:3813 Yes:2560
Commercial:2325 AllPub :1886 1st Qu.: 8272100 No :3296 No :4549
House :2444 ELO :1522 Median :10335050
Other : 26 NoSeWa :1871 Mean :10894910
Others :2310 NoSewr :1829 3rd Qu.:12993900
Max. :23667340
PAVED (Street accessing home is
paved)| PAVED | n | mean(SALES_PRICE) |
|---|---|---|
| Yes | 2560 | 11057905 |
| No | 4549 | 10803182 |
AREA (neighborhood house is
in)| AREA | n | mean(SALES_PRICE) |
|---|---|---|
| Chrompet | 1681 | 10016662 |
| Karapakkam | 1363 | 7338627 |
| Other | 1307 | 11692237 |
| KK Nagar | 996 | 12695806 |
| Velachery | 979 | 11046654 |
| Anna Nagar | 783 | 15159519 |
We can see we have about half of the data as high sales price (>$10M). Looking at the potential predictors related to High Sales Price, we see the strongest relationships with square footage, number of rooms, and likely area.
We see the largest concentration of values around $8M-$12M. The data is also skewed to the right. We can see a decent increase in number of values around $20M. This is actually due to truncation of the data.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 10894686.43 | 14936.17 | 729.42 | 0.00 |
| INT_SQFT | 2366022.24 | 58201.87 | 40.65 | 0.00 |
| DIST_MAINROAD | 7346.79 | 14970.29 | 0.49 | 0.62 |
| N_BEDROOM | -249711.65 | 52896.85 | -4.72 | 0.00 |
| N_BATHROOM | -339960.79 | 26556.61 | -12.80 | 0.00 |
| N_ROOM | 557057.89 | 68201.69 | 8.17 | 0.00 |
| AREA_Chrompet | -272393.69 | 30945.19 | -8.80 | 0.00 |
| AREA_Karapakkam | -1360098.63 | 30467.28 | -44.64 | 0.00 |
| AREA_KK.Nagar | -890012.21 | 31070.79 | -28.64 | 0.00 |
| AREA_Other | -206796.30 | 25524.31 | -8.10 | 0.00 |
| AREA_Velachery | -1228373.72 | 28195.52 | -43.57 | 0.00 |
| SALE_COND_AbNormal | 98520.41 | 225145.09 | 0.44 | 0.66 |
| SALE_COND_Adj.Land | -4387.63 | 22186.04 | -0.20 | 0.84 |
| SALE_COND_AdjLand | 244137.75 | 226742.72 | 1.08 | 0.28 |
| SALE_COND_Family | 52116.42 | 224949.94 | 0.23 | 0.82 |
| SALE_COND_Normal.Sale | 113338.24 | 226122.50 | 0.50 | 0.62 |
| SALE_COND_Partial | -24567.10 | 226504.77 | -0.11 | 0.91 |
| SALE_COND_Partiall | -9371.94 | 21648.29 | -0.43 | 0.67 |
| SALE_COND_PartiaLl | -10957.94 | 16383.11 | -0.67 | 0.50 |
| PARK_FACIL_Noo | -34091.75 | 14959.45 | -2.28 | 0.02 |
| PARK_FACIL_Yes | 508950.92 | 14972.10 | 33.99 | 0.00 |
| BUILDTYPE_Commercial | 561615.45 | 296018.05 | 1.90 | 0.06 |
| BUILDTYPE_House | -1648676.04 | 299683.26 | -5.50 | 0.00 |
| BUILDTYPE_Other | -159858.75 | 40879.41 | -3.91 | 0.00 |
| BUILDTYPE_Others | -1317164.68 | 295520.95 | -4.46 | 0.00 |
| UTILITY_AVAIL_AllPub | -134299.29 | 556792.63 | -0.24 | 0.81 |
| UTILITY_AVAIL_ELO | -246187.92 | 517328.67 | -0.48 | 0.63 |
| UTILITY_AVAIL_NoSeWa | -250714.90 | 555370.78 | -0.45 | 0.65 |
| UTILITY_AVAIL_NoSewr. | -198612.21 | 551290.17 | -0.36 | 0.72 |
| PAVED_No | 17726.73 | 15118.07 | 1.17 | 0.24 |
After examining this model, we determine that there are some predictors that are not important in predicting the house sale price, so a pruned version of the model is created by removing predictors that are not significant.
| model | RMSE | MAE | RSQ |
|---|---|---|---|
| Linear Model | 1256103 | 991227.4 | 0.89 |
For this analysis we will use a pruned Linear Regression Model. We removed Distance from the Mainroad (DIST_MAINROAD), Condition of the house at time of sale, If the street in front of the house is paved, and utilities available.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 10894782.38 | 14931.67 | 729.64 | 0.00 |
| INT_SQFT | 2365317.52 | 58156.36 | 40.67 | 0.00 |
| N_BEDROOM | -249340.96 | 52808.76 | -4.72 | 0.00 |
| N_BATHROOM | -339918.37 | 26506.29 | -12.82 | 0.00 |
| N_ROOM | 557466.25 | 68158.16 | 8.18 | 0.00 |
| AREA_Chrompet | -271559.80 | 30911.53 | -8.79 | 0.00 |
| AREA_Karapakkam | -1359234.92 | 30425.24 | -44.67 | 0.00 |
| AREA_KK.Nagar | -889680.75 | 30738.45 | -28.94 | 0.00 |
| AREA_Other | -206081.61 | 25483.53 | -8.09 | 0.00 |
| AREA_Velachery | -1228241.03 | 28138.68 | -43.65 | 0.00 |
| SALE_COND_AdjLand | 138834.37 | 16371.54 | 8.48 | 0.00 |
| SALE_COND_Family | -52483.34 | 16361.92 | -3.21 | 0.00 |
| SALE_COND_Partial | -129584.16 | 16374.81 | -7.91 | 0.00 |
| PARK_FACIL_Noo | -33982.92 | 14952.94 | -2.27 | 0.02 |
| PARK_FACIL_Yes | 508585.92 | 14957.27 | 34.00 | 0.00 |
| BUILDTYPE_Commercial | 570177.89 | 295848.56 | 1.93 | 0.05 |
| BUILDTYPE_House | -1639490.35 | 299507.45 | -5.47 | 0.00 |
| BUILDTYPE_Other | -158304.43 | 40848.85 | -3.88 | 0.00 |
| BUILDTYPE_Others | -1308923.11 | 295354.31 | -4.43 | 0.00 |
| UTILITY_AVAIL_AllPub | 123298.13 | 16039.32 | 7.69 | 0.00 |
| UTILITY_AVAIL_NoSewr. | 55816.58 | 16064.13 | 3.47 | 0.00 |
| model | RMSE | MAE | RSQ |
|---|---|---|---|
| Linear Model | 1256103 | 991227.4 | 0.89 |
| Linear Final Model | 1256567 | 991979.4 | 0.89 |
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 2.035 | 882.751 | 0.002 | 0.998 |
| AREAChrompet | 0.122 | 0.543 | 0.225 | 0.822 |
| AREAKarapakkam | 6.371 | 0.530 | 12.009 | 0.000 |
| AREAKK Nagar | 2.855 | 0.562 | 5.076 | 0.000 |
| AREAOther | 0.585 | 0.543 | 1.077 | 0.282 |
| AREAVelachery | 5.236 | 0.566 | 9.245 | 0.000 |
| INT_SQFT | -0.009 | 0.001 | -17.074 | 0.000 |
| DIST_MAINROAD | 0.000 | 0.001 | 0.304 | 0.761 |
| N_BEDROOM | 1.690 | 0.303 | 5.581 | 0.000 |
| N_BATHROOM | 0.335 | 0.231 | 1.453 | 0.146 |
| N_ROOM | -1.710 | 0.270 | -6.341 | 0.000 |
| SALE_CONDAbNormal | -0.576 | 1.448 | -0.398 | 0.691 |
| SALE_CONDAdj Land | -11.531 | 284.310 | -0.041 | 0.968 |
| SALE_CONDAdjLand | -1.372 | 1.448 | -0.948 | 0.343 |
| SALE_CONDFamily | -0.618 | 1.448 | -0.426 | 0.670 |
| SALE_CONDNormal Sale | -0.636 | 1.448 | -0.439 | 0.660 |
| SALE_CONDPartial | -0.055 | 1.448 | -0.038 | 0.969 |
| SALE_CONDPartiall | 2.509 | 49.184 | 0.051 | 0.959 |
| SALE_CONDPartiaLl | -9.384 | 882.745 | -0.011 | 0.992 |
| PARK_FACILNoo | 13.630 | 610.826 | 0.022 | 0.982 |
| PARK_FACILYes | -1.952 | 0.114 | -17.085 | 0.000 |
| BUILDTYPECommercial | -0.727 | 3.184 | -0.228 | 0.819 |
| BUILDTYPEHouse | 6.812 | 3.188 | 2.136 | 0.033 |
| BUILDTYPEOther | 6.493 | 3.284 | 1.978 | 0.048 |
| BUILDTYPEOthers | 5.244 | 3.186 | 1.646 | 0.100 |
| UTILITY_AVAILAllPub | 8.015 | 882.743 | 0.009 | 0.993 |
| UTILITY_AVAILELO | 8.681 | 882.743 | 0.010 | 0.992 |
| UTILITY_AVAILNoSeWa | 8.648 | 882.743 | 0.010 | 0.992 |
| UTILITY_AVAILNoSewr | 8.385 | 882.743 | 0.009 | 0.992 |
| PAVEDNo | -0.255 | 0.105 | -2.428 | 0.015 |
| .metric | .estimate |
|---|---|
| accuracy | 0.917 |
| specificity | 0.925 |
| sensitivity | 0.909 |
| model | RMSE | MAE | RSQ |
|---|---|---|---|
| Linear Model | 1256103 | 991227.4 | 0.89 |
| Linear Final Model | 1256567 | 991979.4 | 0.89 |
| Reg Tree Model | 1506578 | 1213475.2 | 0.84 |
Decision Tree Model Specification (regression)
Main Arguments:
cost_complexity = 1e-10
tree_depth = 6
Computational engine: rpart
Model fit template:
rpart::rpart(formula = missing_arg(), data = missing_arg(), weights = missing_arg(),
cp = 1e-10, maxdepth = 6L)
| model | RMSE | MAE | RSQ |
|---|---|---|---|
| Linear Model | 1256103 | 991227.4 | 0.89 |
| Linear Final Model | 1256567 | 991979.4 | 0.89 |
| Reg Tree Model | 1506578 | 1213475.2 | 0.84 |
| Tuned Reg Tree Model | 1085629 | 875359.8 | 0.92 |
UPDATE When predicting the house sale price high/low variable (SALES_PRICEhigh) we have coded it so that Yes means it is high (> 10,000,000) and No otherwise. For this analysis we will perform a classification tree and an logistic regression. Both models have a sensitivity of around 91%. If I had to choose a single model I would choose the classification tree since it is easier to explain.
We will use all the variables except SALES_PRICE because this is what the SALES_PRICEhigh is created from. For this model we will set the cost complexity to .001.
Truth
Prediction Yes No
Yes 3468 238
No 345 3058
| model | Accuracy | Sensitivity | Specificity |
|---|---|---|---|
| Classification Tree Model | 0.92 | 0.91 | 0.93 |
[1] "Best Cutoff 0.6071 Sensitivity 0.9095 Specificity 0.9278 AUC for Model 0.9671"
Truth
Prediction Yes No
Yes 3468 238
No 345 3058
| model | Accuracy | Sensitivity | Specificity |
|---|---|---|---|
| Classification Tree Model | 0.92 | 0.91 | 0.93 |
| Classification Tree Model 60% Cutoff | 0.92 | 0.91 | 0.93 |
For our final model, we will use logistic regression to explore High Sales Price. We can see the # of rooms, bathrooms, bedreooms & interrior square footage per dwelling are most important in the model.**
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -13.78 | 882.75 | -0.02 | 0.99 |
| AREAChrompet | 0.12 | 0.54 | 0.23 | 0.82 |
| AREAKarapakkam | 6.37 | 0.53 | 12.01 | 0.00 |
| AREAKK Nagar | 2.86 | 0.56 | 5.08 | 0.00 |
| AREAOther | 0.58 | 0.54 | 1.08 | 0.28 |
| AREAVelachery | 5.24 | 0.57 | 9.25 | 0.00 |
| INT_SQFT | -4.21 | 0.25 | -17.07 | 0.00 |
| DIST_MAINROAD | 0.02 | 0.05 | 0.30 | 0.76 |
| N_BEDROOM | 1.36 | 0.24 | 5.58 | 0.00 |
| N_BATHROOM | 0.14 | 0.09 | 1.45 | 0.15 |
| N_ROOM | -1.74 | 0.27 | -6.34 | 0.00 |
| SALE_CONDAbNormal | -0.58 | 1.45 | -0.40 | 0.69 |
| SALE_CONDAdj Land | -11.53 | 284.31 | -0.04 | 0.97 |
| SALE_CONDAdjLand | -1.37 | 1.45 | -0.95 | 0.34 |
| SALE_CONDFamily | -0.62 | 1.45 | -0.43 | 0.67 |
| SALE_CONDNormal Sale | -0.64 | 1.45 | -0.44 | 0.66 |
| SALE_CONDPartial | -0.06 | 1.45 | -0.04 | 0.97 |
| SALE_CONDPartiall | 2.51 | 49.18 | 0.05 | 0.96 |
| SALE_CONDPartiaLl | -9.38 | 882.74 | -0.01 | 0.99 |
| PARK_FACILNoo | 13.63 | 610.83 | 0.02 | 0.98 |
| PARK_FACILYes | -1.95 | 0.11 | -17.09 | 0.00 |
| BUILDTYPECommercial | -0.73 | 3.18 | -0.23 | 0.82 |
| BUILDTYPEHouse | 6.81 | 3.19 | 2.14 | 0.03 |
| BUILDTYPEOther | 6.49 | 3.28 | 1.98 | 0.05 |
| BUILDTYPEOthers | 5.24 | 3.19 | 1.65 | 0.10 |
| UTILITY_AVAILAllPub | 8.01 | 882.74 | 0.01 | 0.99 |
| UTILITY_AVAILELO | 8.68 | 882.74 | 0.01 | 0.99 |
| UTILITY_AVAILNoSeWa | 8.65 | 882.74 | 0.01 | 0.99 |
| UTILITY_AVAILNoSewr | 8.39 | 882.74 | 0.01 | 0.99 |
| PAVEDNo | -0.26 | 0.11 | -2.43 | 0.02 |
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.18 | 0.03 | -5.98 | 0 |
| INT_SQFT | -0.50 | 0.09 | -5.38 | 0 |
| N_BEDROOM | 1.60 | 0.10 | 15.40 | 0 |
| N_BATHROOM | 0.27 | 0.04 | 6.73 | 0 |
| N_ROOM | -2.77 | 0.15 | -17.88 | 0 |
Truth
Prediction Yes No
Yes 2882 758
No 927 2536
Truth
Prediction Yes No
Yes 2882 758
No 927 2536
| model | Accuracy | Sensitivity | Specificity |
|---|---|---|---|
| Classification Tree Model | 0.92 | 0.91 | 0.93 |
| Classification Tree Model 60% Cutoff | 0.92 | 0.91 | 0.93 |
| Pruned Logistic Model | 0.76 | 0.76 | 0.77 |
[1] "Best Cutoff 0.5224 Sensitivity 0.7467 Specificity 0.7884 AUC for Model 0.8474"
Truth
Prediction Yes No
Yes 2847 701
No 962 2593
| model | Accuracy | Sensitivity | Specificity |
|---|---|---|---|
| Classification Tree Model | 0.92 | 0.91 | 0.93 |
| Classification Tree Model 60% Cutoff | 0.92 | 0.91 | 0.93 |
| Pruned Logistic Model | 0.76 | 0.76 | 0.77 |
| Logistic Model 25% Cutoff | 0.77 | 0.75 | 0.79 |
UPDATE In Conclusion, we can see that our predictors do help to predict the house sale price, either the high/low sale price (with cutoff at $10,000,000) or the actual sale price.
Combining the results of both types of predictor models and only reporting where agreement was found, we can see that as these variables increase they:
In addition, if we compare the models that we examined for predicting continuous home sale price, we see that the regression tree has larger r-squared.
| model | RMSE | MAE | RSQ |
|---|---|---|---|
| Linear Model | 1256103 | 991227.4 | 0.89 |
| Linear Final Model | 1256567 | 991979.4 | 0.89 |
| Reg Tree Model | 1506578 | 1213475.2 | 0.84 |
| Tuned Reg Tree Model | 1085629 | 875359.8 | 0.92 |
And if we compare the models we examined for predicting the categorical response high sale price, we see that the classification tree has higher accuracy.
| model | Accuracy | Sensitivity | Specificity |
|---|---|---|---|
| Classification Tree Model | 0.92 | 0.91 | 0.93 |
| Classification Tree Model 60% Cutoff | 0.92 | 0.91 | 0.93 |
| Pruned Logistic Model | 0.76 | 0.76 | 0.77 |
| Logistic Model 25% Cutoff | 0.77 | 0.75 | 0.79 |
---
title: "Project Dashboard"
output:
flexdashboard::flex_dashboard:
orientation: columns
vertical_layout: scroll
source_code: embed
theme: yeti
---
```{r setup, include=FALSE,warning=FALSE}
#include=FALSE will not include r code in output
#warning=FALSE will remove any warnings from output
library(GGally) #v2.1.2
library(ggcorrplot) #v0.1.4
library(MASS) #v7.3-58.2 for Boston data
library(flexdashboard) #v0.6.0
library(plotly) #v4.10.1
library(crosstalk) #v1.2.0
library(knitr) #v1.42 kable()
library(tidymodels)
#library(parsnip) #v1.1.0 linear_reg(), set_engine(), set_mode(), fit(), predict()
#library(yardstick) #v1.2.0 metrics(), rac_auc(), roc_curve(), metric_set(), conf_matrix()
#library(dplyr) #v1.1.2 %>%, select(), select_if(), filter(), mutate(), group_by(),
#summarize(), tibble()
#library(ggplot2) #v3.4.2 ggplot()
#library(broom) #v1.0.5 for tidy(), augment(), glance()
#library(rsample) #v1.1.1 initial_split()
```
```{r load_data}
#Load the data
df <- read.csv("Chennai houseing sale.csv")
#creating categorical variable for sales price, dummy variable for paved street & making sure both are factors
#removing variables we will no be using for analysis
# removing QS variables bc masked data
# removing registration fee & commission because those are dependent on sales price
df <- df %>%
dplyr::select(-QS_ROOMS,-QS_BATHROOM,-QS_BEDROOM,-QS_OVERALL,
-COMMIS,-REG_FEE,-MZZONE,-PRT_ID,-DATE_BUILD,-DATE_SALE)
df <- df %>%
mutate(SALES_PRICEhigh =
factor(if_else(SALES_PRICE>10000000,"Yes","No"),levels=c("Yes","No")),
PAVED =factor(if_else(STREET=="Paved","Yes","No"),levels=c("Yes","No")),
BUILDTYPE = factor(BUILDTYPE),
UTILITY_AVAIL = factor(UTILITY_AVAIL),
SALE_COND = factor(SALE_COND),
PARK_FACIL = factor(PARK_FACIL),
AREA = factor(AREA)) %>%
select(-STREET)
df <- df %>%
mutate(AREA=if_else(AREA=="Chrompet","Chrompet",
if_else(AREA =="Karapakkam","Karapakkam",
if_else(AREA=="KK Nagar","KK Nagar",
if_else(AREA=="Velachery","Velachery",
if_else(AREA=="Anna Nagar","Anna Nagar","Other"))))))
```
Introduction {data-orientation=rows}
=======================================================================
Row {data-height=600}
-----------------------------------------------------------------------
### The Project
#### Executive Summary
This project examines the house sale data from Chennai, India. The goals are to try to predict the house sale price. For this analysis, we first examine the distribution of the variables and look for relationships. Next, we perform Regression analysis predicting the house sale price for the neighborhoods. Second, we will perform Classification analysis predicting if a home area has a high or low sale price. Finally, we end with summarizing our conclusions.
The **best regression model was the regression tree** with a *accuracy of 91%* and the **best classification model was the classification tree** with a *Sensitivity of* 92%\*.
We find that the variables that decrease home sale price are:
- Number of Bedrooms (N_BEDROOM)
- Number of Bathrooms (N_BATHROOM)
- The build house type (BUILDTYPE)
The variables that increase home sale price are:
- Interior square footage (INT_SQFT)
- Number of Rooms (N_ROOM)
- If there is Parking (PARK_FACIL)
#### The Problem Description
This project examines housing data from Chennai, India. We will perform both regression and classification analysis. The goal for the **regression models is to predict the house sale price** using the variables in the dataset. For this analysis, we first examine the distribution of the variables and look for relationships. Next, we plan to perform Regression analysis predicting the house sale price for the areas We will use a variety of methods including **linear regression, regression trees, and lasso regression**. Second, we will perform **Classification analysis predicting if an area has a high or low house sale value**. We will use both **logistic regression and classification trees.** Finally, we end with summarizing our conclusions. We will examine the variables in the dataset to determine what helps to predict the house sale value.
#### The Data
This data set has 7109 rows and 16 variables.
#### Data Sources
https://www.kaggle.com/datasets/kunwarakash/chennai-housing-sales-price
### The Data
VARIABLES TO PREDICT WITH
* **AREA**: The area a house is located in Chennai
* **INT_SQFT**: square footage of house
* **DIST_MAINROAD**: distance of house from main road
* **N_BEDROOM**: number of bedrooms
* **N_BATHROOM**: number of bathrooms
* **N_ROOM**: number of rooms
* **SALE_COND**: condition of house at time of sale
* **PARK_FACIL**: is parking available or not (Yes or No)
* **BUILD_TYPE**: purpose of house
* **UTILITY_AVAIL**: facilities available at house
* **PAVED**: If street accessing home is paved or not (Yes or No)
VARIABLES WE WANT TO PREDICT
* **SALES_PRICE**: Sale price of house
* **SALES_PRICEhigh**: Sale Price > $10M coded as Yes, lower coded as No
Data Exploration {data-orientation=rows}
=======================================================================
Column {.sidebar data-width=200}
-------------------------------------
### Data Overview
From this data we can see that our variables have a variety of different values based on their types. The homes range from being built from 1967 to 2002. All house sales are from 2005 - 2014. The top two areas are Chrompet and Karapakkam which hold about 3/7 of the houses. Square footage median is 1373 and the max is 2500. We can see that there are many typos in the data. For example, `PARK_FACIL` (yes or no, if there is parking at the house) has has "No" and "Noo". Other inconsistencies from categorical variables can be seen across the data. In this data, remember `SALES_PRICEhigh` is just a categorical variable that is Yes if sale price is high (> $10M).
Column {data-width=450, data-height=600}
-----------------------------------------------------------------------
### View the Data Summaries
Now we can see the range of values for each variable. The `AREA` (House neighborhood) variables is truncated but we can see the values in the bottom table.
```{r, cache=TRUE}
#View data
summary(df)
```
Column {data-width=150, data-height=300}
-----------------------------------------------------------------------
### Average Sales Price by `PAVED` (Street accessing home is paved)
```{r, cache=TRUE}
#Summary table for chas variable
df %>%
group_by(PAVED) %>%
summarize(n=n(), mean(SALES_PRICE)) %>%
kable(digits=2)
```
### Average Sales Price by `AREA` (neighborhood house is in)
```{r, cache=TRUE}
df %>%
group_by(AREA) %>%
summarize(n=n(),mean(SALES_PRICE)) %>%
arrange(-n) %>%
kable(digits=2)
```
Data Visualization {data-orientation=rows}
=======================================================================
### Response Variables relationships with predictors
* We can see we have about half of the data as high sales price (>$10M). Looking at the potential predictors related to High Sales Price, we see the strongest relationships with square footage, number of rooms, and likely area.
* We see the largest concentration of values around $8M-$12M. The data is also skewed to the right. We can see a decent increase in number of values around $20M. This is actually due to truncation of the data.
Row {data-height=550}
-----------------------------------------------------------------------
#### High Sales Price
```{r, cache=TRUE}
ggplot(df,aes(x=SALES_PRICEhigh)) + geom_bar()
```
#### Sales Price
```{r, cache=TRUE}
ggplot(df, aes(SALES_PRICE)) + geom_histogram(bins=20)
```
Row {.tabset data-height=450}
-----------------------------------------------------------------------
### Sales Price vs Categorial Variables
```{r, cache=TRUE}
ggpairs(dplyr::select(df,SALES_PRICE,PAVED, BUILDTYPE,
UTILITY_AVAIL,SALE_COND,PARK_FACIL, AREA))
```
### Sales Price vs Continuous Variables
```{r, cache=TRUE}
ggcorrplot(cor(dplyr::select(df,SALES_PRICE,N_BEDROOM,N_BATHROOM,
DIST_MAINROAD,INT_SQFT,N_ROOM)))
```
### High Sales Price vs Continuous Variables
```{r, cache=TRUE}
ggpairs(dplyr::select(df, SALES_PRICEhigh, N_BEDROOM, N_BATHROOM, DIST_MAINROAD, INT_SQFT, N_ROOM))
```
### High Sales Price vs Categorical Variables
```{r, cache=TRUE}
df %>% group_by(PAVED, SALES_PRICEhigh) %>%
summarize(n=n()) %>%
ggplot(aes(y=n, x=SALES_PRICEhigh,fill=PAVED)) +
geom_bar(position="dodge", stat="identity") +
geom_text(aes(label=n), position=position_dodge(width=0.9), vjust=-0.25) +
ggtitle("High Sales Price vs Paved Street Access") +
coord_flip() #makes horizontal
```
Regression Model {data-orientation=rows}
=======================================================================
Column {.sidebar data-width=520}
----------------------------------------------------------------------
### Predicting Continuous Sales Price
For the prediction of the continuous variable house sale price (SALES_PRICE), first we will use linear regression.
Row{data-height=2000, column-width = 700, .tabset .tabset-fade}
-----------------------------------------------------------------------
### Linear Regression Full
#### Full Model Results
```{r, cache=TRUE}
reg_recipe <- recipe(SALES_PRICE ~ ., data = dplyr::select(df,-SALES_PRICEhigh)) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_predictors()) %>%
prep()
df_reg_norm <- bake(reg_recipe, df)
#Define the model specification
reg_spec <- linear_reg() %>% ## Class of problem
set_engine("lm") %>% ## The particular function that we use
set_mode("regression") ## type of model
#Fit the model
reg1_fit <- reg_spec %>%
fit(SALES_PRICE ~ .,data = df_reg_norm)
#Capture the predictions and metrics
pred_reg1_fit <- augment(reg1_fit,df_reg_norm)
curr_metrics <- pred_reg1_fit %>%
metrics(truth=SALES_PRICE,estimate=.pred)
results_reg <- tibble(model = "Linear Model",
RMSE = curr_metrics[[1,3]],
MAE = curr_metrics[[3,3]],
RSQ = curr_metrics[[2,3]])
```
#### The Full Regression Model Coefficients
```{r, cache=TRUE}
tidy(reg1_fit) %>%
kable(digits=2)
```
#### Analysis Summary
After examining this model, we determine that there are some predictors that are not important in predicting the house sale price, so a pruned version of the model is created by removing predictors that are not significant.
```{r, cache=TRUE}
results_reg %>%
kable(digits = 2)
```
### Linear Regression Final
For this analysis we will use a pruned Linear Regression Model. We removed Distance from the Mainroad (DIST_MAINROAD), Condition of the house at time of sale, If the street in front of the house is paved, and utilities available.
#### Final Model Results
```{r, cache=TRUE}
reg2_fit <- reg_spec %>%
fit(SALES_PRICE ~ . -DIST_MAINROAD -SALE_COND_AbNormal -UTILITY_AVAIL_NoSeWa -UTILITY_AVAIL_ELO -SALE_COND_Adj.Land
-SALE_COND_Normal.Sale -SALE_COND_Partiall-SALE_COND_PartiaLl-PAVED_No,data = df_reg_norm)
#Capture the predictions and metrics
pred_reg2_fit <- augment(reg2_fit,df_reg_norm)
curr_metrics <- pred_reg2_fit %>%
metrics(truth=SALES_PRICE,estimate=.pred)
results_new <- tibble(model = "Linear Final Model",
RMSE = curr_metrics[[1,3]],
MAE = curr_metrics[[3,3]],
RSQ = curr_metrics[[2,3]])
results_reg <- bind_rows(results_reg, results_new)
reg2_mae <- curr_metrics %>%
filter(.metric=='mae') %>%
pull(.estimate)
```
#### The Final Regression Model Coefficients
```{r, cache=TRUE}
tidy(reg2_fit) %>%
kable(digits=2)
```
#### Residual Assumptions Explorations
```{r, cache=TRUE}
library(performance) #v0.10.0 check_model
reg2_fit %>%
check_model(check=c('linearity','qq'))
```
#### Compare actual (SALES_PRICE) vs predicted (y_hat) for pruned regression model
```{r, cache=TRUE}
#Plot the Actual Versus Predicted Values
ggplotly(ggplot(data = pred_reg2_fit,
aes(x = .pred, y = SALES_PRICE)) +
geom_point(col = "#6e0000") +
geom_abline(slope = 1) +
ggtitle(paste("Pruned Regression with MAE",round(reg2_mae,2))))
```
```{r, cache=TRUE}
results_reg %>%
kable(digits=2)
```
### Predicting Categorical sales price
Here is a look at a logistic regression model predicting high sales price.
```{r, cache=TRUE}
#Define the model specification
log_spec <- logistic_reg() %>%
set_engine('glm') %>%
set_mode('classification')
#Fit the model
log_fit <- log_spec %>%
fit(SALES_PRICEhigh ~ .-SALES_PRICE, data = df)
#Capture the predictions and metrics
my_class_metrics <- metric_set(yardstick::accuracy, yardstick::specificity, yardstick::sensitivity)
pred_log_fit <- augment(log_fit, df)
tidy(log_fit$fit) %>%
kable(digits=3)
pred_log_fit %>%
my_class_metrics(truth=SALES_PRICEhigh,estimate=.pred_class) %>%
select(-.estimator) %>%
kable(digits = 3, align = 'l')
```
# Regression Tree Analysis {data-navmenu="Regression Models"}
Column {.sidebar data-width=520}
----------------------------------------------------------------------
-
#### Analysis Summary
After examining these two trees we can see that **INT_SQFT** and **BUILDTYPE** are the most important variables for both the original tree. The next most important variables are **N_ROOM**, **N_BED**, and **AREA**. We can see that
* If **build type is commercial**, it **decreases sales price**.
* If **area is velachery**, it also **increases sales price**.
Row{data-height=2000, column-width = 700, .tabset .tabset-fade}
-----------------------------------------------------------------------
### Regression Tree
We will predict the home sale price with all the variables.
```{r, cache=TRUE}
#Define the model specification
tree_reg_spec <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("regression")
#Fit the model
tree1_fit <- tree_reg_spec %>%
fit(SALES_PRICE ~ .,data = df_reg_norm)
#Capture the predictions and metrics
pred_tree1_fit <- augment(tree1_fit,df_reg_norm)
curr_metrics <- pred_tree1_fit %>%
metrics(truth=SALES_PRICE,estimate=.pred)
results_new <- tibble(model = "Reg Tree Model",
RMSE = curr_metrics[[1,3]],
MAE = curr_metrics[[3,3]],
RSQ = curr_metrics[[2,3]])
tree1_mae <- curr_metrics %>%
filter(.metric=='mae') %>%
pull(.estimate)
results_reg <- bind_rows(results_reg, results_new)
```
#### View the regression tree.
We see it has 8 leaf nodes.
```{r, cache=TRUE}
library(GGally) #v2.1.2
library(ggcorrplot) #v0.1.4
library(MASS) #v7.3-58.1 for Boston data
library(flexdashboard) #v0.6.0
library(rpart) #v 4.1.19 Partition package to create trees
library(rpart.plot) #v 3.1.1 creates nicer tree plots
library(vip) #v0.3.2 vip()
library(tidymodels)
#library(parsnip) #v1.1.0 linear_reg(), set_engine(), set_mode(), fit(), predict()
#library(yardstick) #v1.2.0 metrics(), rac_auc(), roc_curve(), metric_set(), conf_matrix()
#library(dplyr) #v1.1.2 %>%, select(), select_if(), filter(), mutate(), group_by(),
#summarize(), tibble()
#library(ggplot2) #v3.4.2 ggplot()
#library(broom) #v1.0.5 for tidy(), augment(), glance()
#library(rsample) #v1.1.1 initial_split()
library(plotly) #v4.10.1
library(performance) #v0.10.0 check_model
library(see) #v0.7.3 for check_model plots from performance
library(patchwork) #v1.1.1 for check_model plots from performance
library(knitr) #v1.41 kable()
rpart.plot(tree1_fit$fit, roundint=FALSE)
```
#### View the Variable Importance Plot
```{r, cache=TRUE}
vip(tree1_fit)
```
#### Compare actual (SALES_PRICE) vs predicted (y_hat)
```{r, cache=TRUE}
#Plot the Actual Versus Predicted Values
ggplotly(ggplot(data = pred_tree1_fit,
aes(x = .pred, y = SALES_PRICE)) +
geom_point(col = "#6e0000") +
geom_abline(slope = 1) +
ggtitle(paste("Regression Tree with MAE",round(tree1_mae,2))))
```
#### Compare the Metrics
```{r, cache=TRUE}
results_reg %>%
kable(digits=2)
```
### Tuned Regression Tree
Will tuning improve performance? We'll use cross validation on the cost complexity and the tree depth.
```{r, cache=TRUE}
#Define the model specification
tree_tune_spec <- decision_tree(cost_complexity = tune(),
tree_depth = tune()) %>%
set_engine("rpart") %>%
set_mode("regression")
df_folds <- vfold_cv(df_reg_norm)
tree_grid <- dials::grid_regular(cost_complexity(),
tree_depth(range = c(2, 6)),
levels = 5)
tree2_wf <- workflow() %>%
add_model(tree_tune_spec) %>%
add_formula(SALES_PRICE ~ .)
#Tune on the grid of values
tree2_rs <- tree2_wf %>%
tune_grid(resamples = df_folds,
grid = tree_grid)
#finalize the workflow
final_tree_wf <-
tree2_wf %>%
finalize_workflow(select_best(tree2_rs, metric='rmse'))
final_tree_fit <-
final_tree_wf %>%
fit(data = df_reg_norm) %>%
extract_fit_parsnip()
#Capture the predictions and metrics
pred_tree2_fit <- augment(final_tree_fit,df_reg_norm)
curr_metrics <- pred_tree2_fit %>%
metrics(truth=SALES_PRICE,estimate=.pred)
results_new <- tibble(model = "Tuned Reg Tree Model",
RMSE = curr_metrics[[1,3]],
MAE = curr_metrics[[3,3]],
RSQ = curr_metrics[[2,3]])
tree2_mae = curr_metrics %>%
filter(.metric=='mae') %>%
pull(.estimate)
results_reg <- bind_rows(results_reg, results_new)
```
```{r, cache=TRUE}
final_tree_fit$spec
```
#### View the regression tree.
We see it has 25 leaf nodes.
```{r, cache=TRUE}
rpart.plot(final_tree_fit$fit, roundint=FALSE)
```
#### View the Variable Importance Plot
```{r, cache=TRUE}
vip(final_tree_fit)
```
#### Compare actual (SALES_PRICE) vs predicted (y_hat) for tuned tree
```{r, cache=TRUE}
ggplotly(ggplot(data = pred_tree2_fit,
aes(x = .pred, y = SALES_PRICE)) +
geom_point(col = "#6e0000") +
geom_abline(slope = 1) +
ggtitle(paste("Regression Tuned Tree with MAE",round(tree2_mae,2))))
```
#### Compare the metrics
```{r, cache=TRUE}
results_reg %>%
kable(digits=2)
```
Classification Analysis {data-orientation=rows}
=======================================================================
Row
-----------------------------------------------------------------------
### Classification Models
**UPDATE**
When predicting the house sale price high/low variable (SALES_PRICEhigh) we have coded it so that Yes means it is high (> 10,000,000) and No otherwise. For this analysis we will perform a classification tree and an logistic regression. **Both models have a sensitivity of around 91%**. If I had to choose a single model I would choose the classification tree since it is easier to explain.
Row {data-height=2500 .tabset .tabset-fade}
-------------------------------------
### Classification Trees
We will use all the variables except SALES_PRICE because this is what the SALES_PRICEhigh is created from. For this model we will set the cost complexity to .001.
```{r, cache=TRUE}
class_recipe <- recipe(SALES_PRICEhigh ~ ., data = dplyr::select(df,-SALES_PRICE)) %>%
step_normalize(all_numeric()) %>%
prep()
df_class_norm <- bake(class_recipe, df)
tree_class_spec <- decision_tree(cost_complexity=.001) %>%
set_engine("rpart") %>%
set_mode("classification")
#Fit the model
class_tree1_fit <- tree_class_spec %>%
fit(SALES_PRICEhigh ~ .,data = df_class_norm)
#Capture the predictions and metrics
pred_class_tree1_fit <- augment(class_tree1_fit,df_class_norm)
my_class_metrics <- metric_set(yardstick::accuracy, yardstick::specificity, yardstick::sensitivity)
curr_metrics <- pred_class_tree1_fit %>%
my_class_metrics(truth=SALES_PRICEhigh,estimate=.pred_class)
results_cls <- tibble(model = "Classification Tree Model",
Accuracy = curr_metrics[[1,3]],
Sensitivity = curr_metrics[[3,3]],
Specificity = curr_metrics[[2,3]])
class_tree1_sens <- curr_metrics %>%
filter(.metric=='sens') %>%
pull(.estimate)
```
#### Variable Importance
Here we view the variable importance measures. The higher the value, the more important.
```{r, cache=TRUE}
library(vip) #v0.3.2 vip()
vip(class_tree1_fit)
```
#### View the Classification Tree Plot
We can see we have 5 leaf nodes.
```{r, cache=TRUE}
rpart.plot(class_tree1_fit$fit, type=1, extra = 102, roundint=FALSE)
```
#### Confusion matrix
```{r, cache=TRUE}
pred_class_tree1_fit %>%
conf_mat(truth=SALES_PRICEhigh,estimate=.pred_class)
```
#### View the Metrics
```{r, cache=TRUE}
results_cls %>%
kable(digits = 2, align = 'l')
```
#### Checking the Cutoff
```{r, cache=TRUE}
#Find Best Threshold cutoff
ROC_threshold <- function(pred_data,truth,probs) {
#This function finds the cutoff with the max sum of sensitivity and specificity
#Created tidy version of:
#http://scipp.ucsc.edu/~pablo/pulsarness/Step_02_ROC_and_Table_function.html
#The inputs are the prediction table (from augment()) and the columns for the
#truth and estimate values. The columns need to be strings (i.e., 'sales')
roc_curve_tbl <- pred_data %>%
roc_curve(truth = {{truth}}, {{probs}})
auc = pred_data %>%
roc_auc(truth = {{truth}}, {{probs}}) %>%
pull(.estimate)
best_row = which.max(roc_curve_tbl$specificity + roc_curve_tbl$sensitivity)
print(paste("Best Cutoff", round(roc_curve_tbl[best_row,'.threshold'],4),
"Sensitivity", round(roc_curve_tbl[best_row,'sensitivity'],4),
"Specificity", round(roc_curve_tbl[best_row,'specificity'],4),
"AUC for Model", round(auc,4)))
}
ROC_threshold(pred_class_tree1_fit,'SALES_PRICEhigh', '.pred_Yes')
#Adding a new cutoff prediction column
pred_class_tree1_fit <- pred_class_tree1_fit %>%
mutate(pred_Yes_20 = factor(ifelse(.pred_Yes > .60,"Yes","No"),
levels=c("Yes","No")))
```
#### Confusion matrix for Classification Cutoff 60%
```{r, cache=TRUE}
pred_class_tree1_fit %>%
conf_mat(truth=SALES_PRICEhigh,estimate=pred_Yes_20)
```
#### Metrics for Classification Cutoff 20%
```{r, cache=TRUE}
curr_metrics <- pred_class_tree1_fit %>%
my_class_metrics(truth=SALES_PRICEhigh,estimate=pred_Yes_20)
results_new <- tibble(model = "Classification Tree Model 60% Cutoff",
Accuracy = curr_metrics[[1,3]],
Sensitivity = curr_metrics[[3,3]],
Specificity = curr_metrics[[2,3]])
results_cls <- bind_rows(results_cls, results_new)
results_cls %>%
kable(digits=2, align = 'l')
```
### Logistic Regression
For our final model, we will use logistic regression to explore High Sales Price. We can see the # of rooms, bathrooms, bedreooms & interrior square footage per dwelling are most important in the model.**
#### Logistic Regression Equation
```{r, cache=TRUE}
#Define the model specification
log_spec <- logistic_reg() %>%
set_engine('glm') %>%
set_mode('classification')
#Fit the model
log_fit <- log_spec %>%
fit(SALES_PRICEhigh ~ ., data = df_class_norm)
tidy(log_fit$fit) %>%
kable(digits=2)
```
#### Pruned Logistic Regression Equation
```{r, cache=TRUE}
#Fit the model
log2_fit <- log_spec %>%
fit(SALES_PRICEhigh ~ .-AREA-DIST_MAINROAD-SALE_COND-UTILITY_AVAIL-PARK_FACIL-BUILDTYPE-PAVED, data = df_class_norm)
tidy(log2_fit$fit) %>%
kable(digits=2)
#Capture the predictions and metrics
pred_log2_fit <- augment(log2_fit,df_class_norm)
my_class_metrics <- metric_set(yardstick::accuracy, yardstick::specificity, yardstick::sensitivity)
curr_metrics <- pred_log2_fit %>%
my_class_metrics(truth=SALES_PRICEhigh,estimate=.pred_class)
results_new <- tibble(model = "Pruned Logistic Model",
Accuracy = curr_metrics[[1,3]],
Sensitivity = curr_metrics[[3,3]],
Specificity = curr_metrics[[2,3]])
results_cls <- bind_rows(results_cls, results_new)
class_tree1_sens <- curr_metrics %>%
filter(.metric=='sens') %>%
pull(.estimate)
```
#### Examine the Confusion Matrix
```{r, cache=TRUE}
pred_log2_fit %>%
conf_mat(truth=SALES_PRICEhigh,estimate=.pred_class)
```
#### Variable Importance
Here we view the variable importance measures. The higher the value, the more important.
```{r, cache=TRUE}
vip(log2_fit)
```
#### Confusion matrix
```{r, cache=TRUE}
pred_log2_fit %>%
conf_mat(truth=SALES_PRICEhigh,estimate=.pred_class)
```
#### View the Metrics
```{r, cache=TRUE}
results_cls %>%
kable(digits = 2, align = 'l')
```
#### Checking the Cutoff
```{r, cache=TRUE}
ROC_threshold(pred_log2_fit, 'SALES_PRICEhigh', '.pred_Yes')
#Adding a new cutoff prediction column
pred_log2_fit <- pred_log2_fit %>%
mutate(pred_Yes_25 = factor(ifelse(.pred_Yes > .52,"Yes","No"),
levels=c("Yes","No")))
```
#### Confusion matrix for Logistic Cutoff 52%
```{r, cache=TRUE}
pred_log2_fit %>%
conf_mat(truth=SALES_PRICEhigh,estimate=pred_Yes_25)
```
#### Metrics for Logistic Cutoff 25%
```{r, cache=TRUE}
curr_metrics <- pred_log2_fit %>%
my_class_metrics(truth=SALES_PRICEhigh,estimate=pred_Yes_25)
results_new <- tibble(model = "Logistic Model 25% Cutoff",
Accuracy = curr_metrics[[1,3]],
Sensitivity = curr_metrics[[3,3]],
Specificity = curr_metrics[[2,3]])
results_cls <- bind_rows(results_cls, results_new)
results_cls %>%
kable(digits = 2, align = 'l')
```
Conclusion
=======================================================================
### Summary
**UPDATE**
In Conclusion, we can see that our predictors do help to predict the house sale price, either the high/low sale price (with cutoff at $10,000,000) or the actual sale price.
Combining the results of both types of predictor models and only reporting where agreement was found, we can see that as these variables increase they:
### Predicting Continuous Home Sale Price
In addition, if we compare the models that we examined for predicting continuous home sale price, we see that the regression tree has larger r-squared.
* Linear Regression RSQ: 0.89
* Pruned Regression Tree RSQ: 0.92
#### Summary Metrics Table
```{r, cache=TRUE}
results_reg %>%
kable(digits=2, align = 'l')
```
#### Actual vs Predicted Plot
```{r, cache=TRUE}
df_act_pred <- bind_rows(
pred_reg1_fit %>% mutate(model = 'Linear Model'),
pred_reg2_fit %>% mutate(model = 'Linear Final Model'),
pred_tree1_fit %>% mutate(model = 'Reg Tree Model'),
pred_tree2_fit %>% mutate(model = 'Tuned Reg Tree Model')
)
ggplotly(ggplot(df_act_pred, aes(y = .pred, x = SALES_PRICE, color=model)) +
geom_point() +
geom_abline(col = "gold") +
ggtitle("Predicted vs Actual Median Value") )
```
### Predicting Categorical Median Value
And if we compare the models we examined for predicting the categorical response high sale price, we see that the classification tree has higher accuracy.
* Classification Tree (cutoff .60) Accuracy .92 Sensitivity .91
* Logistic Regression (cutoff .52) Accuracy .77 Sensitivity .75
#### Summary Metrics Table
```{r, cache=TRUE}
results_cls %>%
kable(digits=2, align = 'l')
```
#### ROC Curves
```{r, cache=TRUE}
#Capture the auc
log_auc <- pred_log2_fit %>%
roc_auc(truth=SALES_PRICEhigh, .pred_Yes) %>%
pull(.estimate)
tree_auc <- pred_class_tree1_fit %>%
roc_auc(truth=SALES_PRICEhigh, .pred_Yes) %>%
pull(.estimate)
#Capture the thresholds and sens/spec
df_roc <- bind_rows(pred_log2_fit %>%
roc_curve(truth = SALES_PRICEhigh, .pred_Yes) %>%
mutate(model = paste('Logistic', round(log_auc,2))),
pred_class_tree1_fit %>%
roc_curve(truth = SALES_PRICEhigh, .pred_Yes) %>%
mutate(model = paste('Class Tree', round(tree_auc,2))),
)
#Create the ROC Curve(s)
ggplotly(ggplot(df_roc,
aes(x = 1 - specificity, y = sensitivity,
group = model, col = model)) +
geom_path() +
geom_abline(lty = 3) +
scale_color_brewer(palette = "Dark2") +
theme(legend.position = "top"))
```